Run Modin on cluster #46

pschafhalter · 2018-07-19T00:19:18Z

This PR lays the groundwork for using modin on clusters. It exposes a primitive way method for running Modin on a cluster using a jupyter notebook.

Example use

modin notebook --config=config.yaml --port=8889
This will launch a Modin cluster configured by config.yaml with the jupyter notebook accessible at localhost:8889.

See example_config.yaml for information on the config file.

What do these changes do?

Add scripts to set up Ray cluster
Add script to launch jupyter notebook on driver
Add modin console command
SSH forwarding from driver
Documentation

Future work for follow-up PRs

Usability improvements
- Insert ray.init(redis_address=...) into notebook or override ray initialization import modin
- Open notebook in browser
Improved error handling
- Detecting whether port specified for forwarding is open
- Better notifications for errors in nodes
passes git diff upstream/master -u -- "*.py" | flake8 --diff

kunalgosar · 2018-07-19T03:12:55Z

I don't think we should merge this until at least the Documentation is ready. Potentially we can all collaborate on this branch in the meantime?

pschafhalter · 2018-07-19T07:01:10Z

I'll add documentation tomorrow.

devin-petersohn

Left a few comments. Overall looks good!

devin-petersohn · 2018-07-19T14:24:03Z

modin/scripts/configure_head_node.sh

+KEY=$2
+
+ssh -i $2 -o "StrictHostKeyChecking no" $1 << "ENDSSH"
+pip3 install modin jupyter


We might want to think about this line for the final implementation. Ideally, we would let the user specify the python they want to use (or somehow require it on the $PATH).

For now, I would suggest just doing python -m pip so it installs for the python on the $PATH. How does this affect ray start and ray stop?

I'll add this fix. It shouldn't affect ray start and ray stop unless the user manually installs Ray with another version of python.

devin-petersohn · 2018-07-19T14:49:31Z

modin/scripts/scripts.py

+    redis_address = cluster.setup_cluster(config)
+    print("\nLaunching notebook\n")
+    print("*" * 68)
+    print(("To connect to the cluster, run the following commands in the "


Ideally this would go in our __init__ so we should be able to detect if the cli was used.

Agreed, I'm still thinking about how to best set this up. Do you think setting an environment variable with the redis address and detecting that in the __init__ is a good solution?

That would probably be good, but we probably should use two environment variables, one for MODIN_EXECUTION_FRAMEWORK and one for MODIN_RAY_REDIS_ADDRESS or something like that.

devin-petersohn · 2018-07-19T14:50:25Z

modin/scripts/cluster.py

+def setup_cluster(config):
+    """Sets up a cluster given a valid configuration"""
+    if config["execution_engine"] != "ray":
+        raise ValueError("Only Ray clusters supported for now")


Prefer NotImplementedError

robertnishihara · 2018-07-24T07:07:35Z

We've tried to make it super easy to start/manage a Ray cluster with the documentation in http://ray.readthedocs.io/en/latest/using-ray-on-a-large-cluster.html (for the existing cluster case), and http://ray.readthedocs.io/en/latest/autoscaling.html (for AWS/GCP).

This PR is targeting the existing the existing cluster case. If you find configure_ray_cluster.sh easier to use than what we currently have in our documentation, we should add it to the Ray repository (it would be useful for any Ray user, right?).

robertnishihara · 2018-07-24T07:08:23Z

modin/experimental/scripts/cluster.py

+
+def check_required(config, schema):
+    """Check required config entries"""
+    if type(config) is not dict and type(config) is not list:


It's more canonical to use isinstance

robertnishihara · 2018-07-24T07:09:06Z

modin/experimental/scripts/configure_ray_cluster.sh

@@ -0,0 +1,52 @@
+#!/usr/bin/sh


I've made some scripts like this in the past, and in the end I've always had to convert them from bash -> Python. Happy to share examples.

Would love to take a look!

robertnishihara · 2018-07-24T07:10:54Z

modin/experimental/scripts/scripts.py

@@ -0,0 +1,44 @@
+from __future__ import absolute_import
+from __future__ import print_function


did you intend to omit from __future__ import division?

robertnishihara · 2018-07-24T07:13:16Z

modin/experimental/scripts/cluster.py

@@ -0,0 +1,146 @@
+import os
+import subprocess
+import yaml


@ericl can you comment on whether whether it makes sense to support an autoscaler config that takes in a bunch of IP addresses for nodes in an existing cluster?

Yes, that should work

robertnishihara · 2018-07-24T07:13:49Z

modin/experimental/scripts/cluster.py

@@ -0,0 +1,146 @@
+import os


I'd suggest adding the from __future__ import ... lines for consistency

robertnishihara · 2018-07-24T07:14:56Z

modin/pandas/__init__.py

+        if os.environ.get("MODIN_EXECUTION_FRAMEWORK") == "ray" and \
+                os.environ.get("MODIN_RAY_REDIS_ADDRESS"):
+            redis_address = os.environ.get("MODIN_RAY_REDIS_ADDRESS")
+            ray.init(redis_address=redis_address)


I suspect it will simplify users' lives to just call ray.init(redis_addres=...) but I could be wrong about that.

@devin-petersohn thoughts? This was what I was doing in an older version of the PR.

pschafhalter · 2018-07-24T08:19:58Z

@robertnishihara agreed, if parts of this PR generalize to generic Ray clusters, I would be happy to merge them upstream. At first, I hoped to first take advantage of the autoscaler scripts in order to easily run Pandas on Ray AWS/GCP clusters, but after some discussion @devin-petersohn thought it might be best to develop a script for manually managed clusters first.

simon-mo · 2018-08-27T23:18:23Z

@pschafhalter Will you be ok if I push this PR forward?

pschafhalter · 2018-08-28T02:57:09Z

@simon-mo sure, although I think it should be redesigned to take advantage of the Ray's newly added support for local clusters ray-project/ray#2678

AmplabJenkins · 2018-10-04T01:20:38Z

Can one of the admins verify this patch?

devin-petersohn · 2018-10-04T04:46:52Z

Jenkins, add to whitelist.

AmplabJenkins · 2018-10-04T04:50:50Z

Build finished. Test FAILed.

AmplabJenkins · 2018-10-04T04:50:50Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin/12/
Test FAILed.

AmplabJenkins · 2018-10-26T22:48:24Z

Build finished. Test FAILed.

AmplabJenkins · 2018-10-26T22:48:24Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-Testing/6/
Test FAILed.

pschafhalter · 2019-05-05T07:06:31Z

Closed since this is out-of-date

support np.float64 literal

* Fix nlargest and smallest support Signed-off-by: Naren Krishna <[email protected]>

Signed-off-by: Naren Krishna <[email protected]>

…a service Signed-off-by: Devin Petersohn <[email protected]> Fixes to pass CI + docs for io.py Update implementation Signed-off-by: Devin Petersohn <[email protected]> Fix some things Signed-off-by: Devin Petersohn <[email protected]> Lint fixes Fix put Signed-off-by: Devin Petersohn <[email protected]> Clean up and add new details Signed-off-by: Devin Petersohn <[email protected]> Use fsspec to get full path and allow URLs Signed-off-by: Devin Petersohn <[email protected]> Add lazy loc Signed-off-by: Devin Petersohn <[email protected]> fixes for tests porting more tests more fixes moar fixes Raise exception Signed-off-by: Devin Petersohn <[email protected]> Lint fixes Return Python as the default modin engine Handle indexing case for client qc Call fast path for __getitem__ if not lazy Remove user warning for Python-engine fall back Add init Signed-off-by: Devin Petersohn <[email protected]> Implement free as a no-op Signed-off-by: Devin Petersohn <[email protected]> Add support for replace - client side Fix a couple of issues with Client Signed-off-by: Devin Petersohn <[email protected]> Throw errors on to_pandas Signed-off-by: Devin Petersohn <[email protected]> Do not default to pandas for str_repeat Add support for 18 datetime functions/properties Fix columns caching when renaming columns Fix test_query: put backticks back for col names Add support for astype -- client side hard coded changes for functions Client support for str_(en/de)code, to_datetime Add all missing query compiler methods. Signed-off-by: mvashishtha <[email protected]> Fix getitem_column_array and take_2d. Signed-off-by: mvashishtha <[email protected]> Fix getitem_column_array and take_2d. Signed-off-by: mvashishtha <[email protected]> Fix again. Signed-off-by: mvashishtha <[email protected]> Fix more bugs. Signed-off-by: mvashishtha <[email protected]> More fixes. Signed-off-by: mvashishtha <[email protected]> Fix more bugs-- pushdown tests test_dates and test_pivot still broken due to service bugs. Signed-off-by: mvashishtha <[email protected]> Fix typo. Note drop() broken because service requires you to specify both argument and client QC at base of this PR uses default Nones. Signed-off-by: mvashishtha <[email protected]> Add query compiler class. Signed-off-by: mvashishtha <[email protected]> Testing a commit Initial changes for adding support for Expanding FEAT Support for rolling.sem FEAT support for Expanding sum, min, max, mean, var, std, count, sem Removing extratenous comment REFACTOR: Remove defaults to pandas at API layer and add some corresponding client QC methods. Signed-off-by: mvashishtha <[email protected]> Add more methods. Signed-off-by: mvashishtha <[email protected]> Fix expanding. Signed-off-by: mvashishtha <[email protected]> Add ewm. Signed-off-by: mvashishtha <[email protected]> Revert whitespace. Signed-off-by: mvashishtha <[email protected]> Fix to_numpy by making it like to_pandas. Signed-off-by: mvashishtha <[email protected]> Remove extra to_numpy. Signed-off-by: mvashishtha <[email protected]> Pass kwargs Signed-off-by: mvashishtha <[email protected]> Fix DataFrame import for isin. Signed-off-by: mvashishtha <[email protected]> Fix again. Signed-off-by: mvashishtha <[email protected]> Remove breakpoint Signed-off-by: mvashishtha <[email protected]> Tell if series. Signed-off-by: mvashishtha <[email protected]> Fix client qc. Signed-off-by: mvashishtha <[email protected]> Add self_is_series. Signed-off-by: mvashishtha <[email protected]> FIX: Set numeric_only to True in groupby quantile Add some comments Fix str_cat/fullmatch/removeprefix/removesuffix/translate/wrap (modin-project#44) * Fix str_cat/fullmatch/removeprefix/removesuffix/translate/wrap * Update modin/core/storage_formats/base/query_compiler.py Co-authored-by: Mahesh Vashishtha <[email protected]> * Update modin/pandas/series_utils.py Co-authored-by: Mahesh Vashishtha <[email protected]> * Update modin/core/storage_formats/base/query_compiler.py Co-authored-by: Mahesh Vashishtha <[email protected]> Co-authored-by: Mahesh Vashishtha <[email protected]> FEAT Support expanding.aggregate (modin-project#45) Fix at_time and between_time. (modin-project#43) Signed-off-by: mvashishtha <[email protected]> Signed-off-by: mvashishtha <[email protected]> Add QC method for groupby.sem (modin-project#47) * FEAT: Add partial support for groupby.sem() * Add sem changes to groupby Fix nlargest and nsmallest Series support (modin-project#46) * Fix nlargest and smallest support Signed-off-by: Naren Krishna <[email protected]> Remove client query compiler's columnarize. (modin-project#48) Signed-off-by: mvashishtha <[email protected]> Signed-off-by: mvashishtha <[email protected]> Fix info and set memory_usage=False. (modin-project#49) Signed-off-by: mvashishtha <[email protected]> Signed-off-by: mvashishtha <[email protected]> POND-815 fixes for 21 column dataset (modin-project#50) * POND-815 fixes for 21 column dataset * Update modin/pandas/base.py Co-authored-by: helmeleegy <[email protected]> --------- Co-authored-by: helmeleegy <[email protected]> Bring in upstream series binary operation fix 6d5545f… (modin-project#52) * Bring in upstream series binary operation fix 6d5545f. Signed-off-by: mvashishtha <[email protected]> * Update modin/pandas/series.py Co-authored-by: Karthik Velayutham <[email protected]> --------- Signed-off-by: mvashishtha <[email protected]> Co-authored-by: Karthik Velayutham <[email protected]> Support groupby first/last (modin-project#53) Signed-off-by: Naren Krishna <[email protected]> FEAT: Add initial partial support for groupby.cumcount() (modin-project#54) * FEAT: Add partial support for cumcount * Remove the set_index_name * Squeeze the result * Write cumcount name to None * Can't set dtype to int64 Fix resample sum, prod, size (modin-project#56) Signed-off-by: Naren Krishna <[email protected]> POND-184: fix describe and simplify query compiler interface (modin-project#55) * Fix describe Signed-off-by: mvashishtha <[email protected]> * Pass datetime_is_numeric. Signed-off-by: mvashishtha <[email protected]> --------- Signed-off-by: mvashishtha <[email protected]> Fix dt_day_of_week/day_of_year, str_cat/extract/partition/replace/rpartition (modin-project#51) * Fix dt_day_of_week/day_of_year, str_partition/replace/rpartition * Fix str_extract Revert "Fix dt_day_of_week/day_of_year, str_cat/extract/partition/replace/rpartition (modin-project#51)" (modin-project#58) This reverts commit f7a31ab. Revert "Revert "Fix dt_day_of_week/day_of_year, str_cat/extract/partition/replace/rpartition (modin-project#51)" (modin-project#58)" (modin-project#60) This reverts commit ad9231d. Add query compiler method for groupby.prod() (modin-project#57) Signed-off-by: Naren Krishna <[email protected]> FEAT: Add support for groupby.head and groupby.tail (modin-project#61) * FEAT: Add support for groupby.head and groupby.tail * Change _change_index FEAT: Add partial support for groupby.nth (modin-project#62) FIX: Push first and last down to query compiler. (modin-project#64) * FIX: Push first and last down to query compiler. Signed-off-by: mvashishtha <[email protected]> * Fix last. Signed-off-by: mvashishtha <[email protected]> --------- Signed-off-by: mvashishtha <[email protected]> FEAT: Add partial support for groupby.ngroup (modin-project#65) * FEAT: Add partial support for groupby.ngroup * Name of result should be none for now Add client support for SeriesGroupby unique, nsmallest, nlargest (modin-project#63) * Add client support for SeriesGroupby unique, nsmallest, nlargest Signed-off-by: Naren Krishna <[email protected]> --------- Signed-off-by: Naren Krishna <[email protected]> Push memory_usage entirely to query compiler [change is not to be upstreamed to Modin] (modin-project#66) * Fix dataframe memory usage. Signed-off-by: mvashishtha <[email protected]> * Fix series memory_usage() the same way. Signed-off-by: mvashishtha <[email protected]> --------- Signed-off-by: mvashishtha <[email protected]> FIX: allow updating backend query compilers in place. (modin-project#67) * FIX: Mutate client query compiler columns and index in the service. Motivation: Align axis update semantics across query compilers. In the base query compiler and even our service's query compiler, you can update the index and columns in place. However, the service gives no way to update axes of a query compiler. Right now, for inplace updates, service exposes an extra method rename(), and client query compiler uses this to get the id of a new compiler with updated axis, and then updates its id ID of the new query compiler. This change might be the first to make the service present a mutable interface for a backend query compiler. That seems safe to me, except I had to make copy() get a new query compiler copied from the old query compiler, because we can't let updates to the new query compiler change the original (or vice versa). Signed-off-by: mvashishtha <[email protected]> * Add a comment. Signed-off-by: mvashishtha <[email protected]> --------- Signed-off-by: mvashishtha <[email protected]> FEAT replace groupby.fillna with a simpler logic (modin-project#68) * FEAT Support expanding.aggregate * Replaced groupby.fillna logic with a simpler one * Fix in groupby.fillna. Work object was causing problems. * Only need to change _check_index_name to _check_index * Removed commented out code.

Signed-off-by: Naren Krishna <[email protected]>

pschafhalter added 4 commits July 17, 2018 20:28

Add cluster shell scripts

10b4ebe

Add tools to parse config

9205f0c

Initial working cluster version

2b347a5

formatting

b5afcc9

pschafhalter added the enhancement label Jul 19, 2018

Comment example config

8791adf

devin-petersohn reviewed Jul 19, 2018

View reviewed changes

pschafhalter added 8 commits July 19, 2018 11:23

Address comments

d3f3f6f

Use system python

0e64cac

Add documentation

bfadd37

Fix

a0ac2fe

Use environment variables to configure Modin/Ray

8c5ba14

Rename execution[_ ]engine -> execution[_ ]framework

3de703f

Configure scripts to use environment variables

9995256

Add __init__ for py2

965db6d

pschafhalter changed the title ~~Run Modin on cluster~~ [WIP] Run Modin on cluster Jul 20, 2018

Make setting up cluster more robust

36f28a7

pschafhalter changed the title ~~[WIP] Run Modin on cluster~~ Run Modin on cluster Jul 20, 2018

pschafhalter added 3 commits July 20, 2018 16:18

Move to experimental

97a0110

Fix entry point

1ec3aab

Set default port

9ae2f71

robertnishihara reviewed Jul 24, 2018

View reviewed changes

pschafhalter added 2 commits July 24, 2018 01:21

Add future imports

72f9f9f

Use isisntance

6d90b02

devin-petersohn added new feature/request 💬 Requests and pull requests for new features and removed enhancement labels Sep 28, 2018

pschafhalter closed this May 5, 2019

dchigarev pushed a commit to dchigarev/modin that referenced this pull request Aug 25, 2020

Merge pull request modin-project#46 from intel-go/ienkovich/float64

aeec32a

support np.float64 literal

mvashishtha pushed a commit to mvashishtha/modin that referenced this pull request Jan 26, 2023

Fix nlargest and nsmallest Series support (modin-project#46)

22bf6ae

* Fix nlargest and smallest support Signed-off-by: Naren Krishna <[email protected]>

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Feb 13, 2023

Fix nlargest and nsmallest Series support (modin-project#46) (core)

ba7ba0b

Signed-off-by: Naren Krishna <[email protected]>

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Feb 27, 2023

Fix nlargest and nsmallest Series support (modin-project#46) (service)

466d8c7

Signed-off-by: Naren Krishna <[email protected]>

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

Fix nlargest and nsmallest Series support (modin-project#46) (core)

3310fdc

Signed-off-by: Naren Krishna <[email protected]>

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

Fix nlargest and nsmallest Series support (modin-project#46) (service)

fea8b97

Signed-off-by: Naren Krishna <[email protected]>

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

Fix nlargest and nsmallest Series support (modin-project#46) (core)

163cee5

Signed-off-by: Naren Krishna <[email protected]>

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

Fix nlargest and nsmallest Series support (modin-project#46) (service)

8e8697c

Signed-off-by: Naren Krishna <[email protected]>

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

Fix nlargest and nsmallest Series support (modin-project#46) (service)

30826cf

Signed-off-by: Naren Krishna <[email protected]>

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

Fix nlargest and nsmallest Series support (modin-project#46) (core)

6d25e7a

Signed-off-by: Naren Krishna <[email protected]>

vnlitvinov pushed a commit to vnlitvinov/modin that referenced this pull request Mar 16, 2023

Fix nlargest and nsmallest Series support (modin-project#46) (service)

57870d0

Signed-off-by: Naren Krishna <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run Modin on cluster #46

Run Modin on cluster #46

pschafhalter commented Jul 19, 2018 •

edited

Loading

kunalgosar commented Jul 19, 2018

pschafhalter commented Jul 19, 2018

devin-petersohn left a comment

devin-petersohn Jul 19, 2018

pschafhalter Jul 19, 2018

devin-petersohn Jul 19, 2018

pschafhalter Jul 19, 2018

devin-petersohn Jul 19, 2018

devin-petersohn Jul 19, 2018

robertnishihara commented Jul 24, 2018

robertnishihara Jul 24, 2018

robertnishihara Jul 24, 2018

pschafhalter Jul 24, 2018 •

edited

Loading

robertnishihara Jul 24, 2018

robertnishihara Jul 24, 2018

ericl Jul 24, 2018

robertnishihara Jul 24, 2018

robertnishihara Jul 24, 2018

pschafhalter Jul 24, 2018

pschafhalter commented Jul 24, 2018

simon-mo commented Aug 27, 2018

pschafhalter commented Aug 28, 2018

AmplabJenkins commented Oct 4, 2018

devin-petersohn commented Oct 4, 2018

AmplabJenkins commented Oct 4, 2018

AmplabJenkins commented Oct 4, 2018

AmplabJenkins commented Oct 26, 2018

AmplabJenkins commented Oct 26, 2018

pschafhalter commented May 5, 2019

		@@ -0,0 +1,44 @@
		from __future__ import absolute_import
		from __future__ import print_function

Run Modin on cluster #46

Run Modin on cluster #46

Conversation

pschafhalter commented Jul 19, 2018 • edited Loading

Example use

What do these changes do?

Future work for follow-up PRs

kunalgosar commented Jul 19, 2018

pschafhalter commented Jul 19, 2018

devin-petersohn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertnishihara commented Jul 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pschafhalter Jul 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pschafhalter commented Jul 24, 2018

simon-mo commented Aug 27, 2018

pschafhalter commented Aug 28, 2018

AmplabJenkins commented Oct 4, 2018

devin-petersohn commented Oct 4, 2018

AmplabJenkins commented Oct 4, 2018

AmplabJenkins commented Oct 4, 2018

AmplabJenkins commented Oct 26, 2018

AmplabJenkins commented Oct 26, 2018

pschafhalter commented May 5, 2019

pschafhalter commented Jul 19, 2018 •

edited

Loading

pschafhalter Jul 24, 2018 •

edited

Loading